Cells occupy a hierarchy of transcriptional identities which is difficult to study in an unbiased manner when perturbed by disease. To identify, characterize, and compare clusters of cells, we present CATCH, a coarse graining framework that learns the cellular hierarchy by applying a deep cascade of manifold-intrinsic diffusion filters. CATCH includes a suite of tools based on the connection we forge between topological data analysis and data diffusion geometry to identify salient levels of the hierarchy, automatically characterize clusters and rapidly compute differentially expressed genes between clusters of interest. When used in conjunction with MELD (https://github.com/KrishnaswamyLab/MELD), CATCH has been shown to identify rare popultions of pathogenic cells and create robust disease signatures.
First, let's import CATCH along with all our other packages and proceed with analysis. If you haven't already, install the CATCH package using pip as follows (requirements located at https://github.com/KrishnaswamyLab/CATCH)
!pip install git+https://github.com/KrishnaswamyLab/CATCH
Collecting git+https://github.com/KrishnaswamyLab/CATCH Cloning https://github.com/KrishnaswamyLab/CATCH to /tmp/pip-req-build-96ni6qkf Running command git clone -q https://github.com/KrishnaswamyLab/CATCH /tmp/pip-req-build-96ni6qkf Resolved https://github.com/KrishnaswamyLab/CATCH to commit 803c7ddb166c43963d60bba9d43998a275ca5105 Preparing metadata (setup.py) ... done Requirement already satisfied: numpy in /gpfs/ysm/project/krishnaswamy_smita/mrk63/conda_envs/notebook_env/lib/python3.9/site-packages (from CATCH==0.0) (1.20.3) Requirement already satisfied: scipy in /gpfs/ysm/project/krishnaswamy_smita/mrk63/conda_envs/notebook_env/lib/python3.9/site-packages (from CATCH==0.0) (1.5.3) Requirement already satisfied: graphtools in /gpfs/ysm/project/krishnaswamy_smita/mrk63/conda_envs/notebook_env/lib/python3.9/site-packages (from CATCH==0.0) (1.5.2) Requirement already satisfied: phate in /gpfs/ysm/project/krishnaswamy_smita/mrk63/conda_envs/notebook_env/lib/python3.9/site-packages (from CATCH==0.0) (1.0.7) Requirement already satisfied: scikit-learn in /gpfs/ysm/project/krishnaswamy_smita/mrk63/conda_envs/notebook_env/lib/python3.9/site-packages (from CATCH==0.0) (1.0.1) Requirement already satisfied: tasklogger in /gpfs/ysm/project/krishnaswamy_smita/mrk63/conda_envs/notebook_env/lib/python3.9/site-packages (from CATCH==0.0) (1.1.0) Requirement already satisfied: joblib in /gpfs/ysm/project/krishnaswamy_smita/mrk63/conda_envs/notebook_env/lib/python3.9/site-packages (from CATCH==0.0) (1.1.0) Requirement already satisfied: pandas in /gpfs/ysm/project/krishnaswamy_smita/mrk63/conda_envs/notebook_env/lib/python3.9/site-packages (from CATCH==0.0) (1.3.4) Requirement already satisfied: future in /gpfs/ysm/project/krishnaswamy_smita/mrk63/conda_envs/notebook_env/lib/python3.9/site-packages (from graphtools->CATCH==0.0) (0.18.2) Requirement already satisfied: pygsp>=0.5.1 in /gpfs/ysm/project/krishnaswamy_smita/mrk63/conda_envs/notebook_env/lib/python3.9/site-packages (from graphtools->CATCH==0.0) (0.5.1) Requirement already satisfied: Deprecated in /gpfs/ysm/project/krishnaswamy_smita/mrk63/conda_envs/notebook_env/lib/python3.9/site-packages (from graphtools->CATCH==0.0) (1.2.13) Requirement already satisfied: threadpoolctl>=2.0.0 in /gpfs/ysm/project/krishnaswamy_smita/mrk63/conda_envs/notebook_env/lib/python3.9/site-packages (from scikit-learn->CATCH==0.0) (3.0.0) Requirement already satisfied: python-dateutil>=2.7.3 in /gpfs/ysm/project/krishnaswamy_smita/mrk63/conda_envs/notebook_env/lib/python3.9/site-packages (from pandas->CATCH==0.0) (2.8.2) Requirement already satisfied: pytz>=2017.3 in /gpfs/ysm/project/krishnaswamy_smita/mrk63/conda_envs/notebook_env/lib/python3.9/site-packages (from pandas->CATCH==0.0) (2021.3) Requirement already satisfied: s-gd2>=1.5 in /gpfs/ysm/project/krishnaswamy_smita/mrk63/conda_envs/notebook_env/lib/python3.9/site-packages (from phate->CATCH==0.0) (1.8) Requirement already satisfied: matplotlib>=3.0 in /gpfs/ysm/project/krishnaswamy_smita/mrk63/conda_envs/notebook_env/lib/python3.9/site-packages (from phate->CATCH==0.0) (3.5.0) Requirement already satisfied: scprep>=0.11.1 in /gpfs/ysm/project/krishnaswamy_smita/mrk63/conda_envs/notebook_env/lib/python3.9/site-packages (from phate->CATCH==0.0) (1.1.0) Requirement already satisfied: kiwisolver>=1.0.1 in /gpfs/ysm/project/krishnaswamy_smita/mrk63/conda_envs/notebook_env/lib/python3.9/site-packages (from matplotlib>=3.0->phate->CATCH==0.0) (1.3.2) Requirement already satisfied: packaging>=20.0 in /gpfs/ysm/project/krishnaswamy_smita/mrk63/conda_envs/notebook_env/lib/python3.9/site-packages (from matplotlib>=3.0->phate->CATCH==0.0) (21.0) Requirement already satisfied: fonttools>=4.22.0 in /gpfs/ysm/project/krishnaswamy_smita/mrk63/conda_envs/notebook_env/lib/python3.9/site-packages (from matplotlib>=3.0->phate->CATCH==0.0) (4.28.1) Requirement already satisfied: pillow>=6.2.0 in /gpfs/ysm/project/krishnaswamy_smita/mrk63/conda_envs/notebook_env/lib/python3.9/site-packages (from matplotlib>=3.0->phate->CATCH==0.0) (8.4.0) Requirement already satisfied: pyparsing>=2.2.1 in /gpfs/ysm/project/krishnaswamy_smita/mrk63/conda_envs/notebook_env/lib/python3.9/site-packages (from matplotlib>=3.0->phate->CATCH==0.0) (3.0.6) Requirement already satisfied: cycler>=0.10 in /gpfs/ysm/project/krishnaswamy_smita/mrk63/conda_envs/notebook_env/lib/python3.9/site-packages (from matplotlib>=3.0->phate->CATCH==0.0) (0.11.0) Requirement already satisfied: six>=1.5 in /gpfs/ysm/project/krishnaswamy_smita/mrk63/conda_envs/notebook_env/lib/python3.9/site-packages (from python-dateutil>=2.7.3->pandas->CATCH==0.0) (1.16.0) Requirement already satisfied: decorator>=4.3.0 in /gpfs/ysm/project/krishnaswamy_smita/mrk63/conda_envs/notebook_env/lib/python3.9/site-packages (from scprep>=0.11.1->phate->CATCH==0.0) (5.1.0) Requirement already satisfied: wrapt<2,>=1.10 in /gpfs/ysm/project/krishnaswamy_smita/mrk63/conda_envs/notebook_env/lib/python3.9/site-packages (from Deprecated->graphtools->CATCH==0.0) (1.12.1)
import numpy as np
import pandas as pd
import graphtools
from CATCH import catch
import matplotlib.pyplot as plt
import scprep
import phate
import sklearn
import tasklogger
import collections
import warnings
from collections import defaultdict
from scipy.spatial.distance import pdist, cdist, squareform
warnings.simplefilter("ignore")
In this section we will download 10X human Peripheral Blood Mononuclear Cell (PBMC) data to your local computer and pre-process it for analysis with CATCH.
import os
data_dir = os.path.expanduser("~/catch_data") # enter path to data directory here (this is where you want to save 10X data)
if not os.path.isdir(data_dir):
os.mkdir(data_dir)
file_name = '10X_pbmc_data.h5'
file_path = os.path.join(data_dir, file_name)
URL = 'https://cf.10xgenomics.com/samples/cell-exp/2.1.0/pbmc4k/pbmc4k_raw_gene_bc_matrices_h5.h5'
scprep.io.download.download_url(URL, file_path)
data = scprep.io.load_10X_HDF5(file_path, gene_labels='both')
data.head()
| RP11-34P13.3 (ENSG00000243485) | FAM138A (ENSG00000237613) | OR4F5 (ENSG00000186092) | RP11-34P13.7 (ENSG00000238009) | RP11-34P13.8 (ENSG00000239945) | RP11-34P13.14 (ENSG00000239906) | RP11-34P13.9 (ENSG00000241599) | FO538757.3 (ENSG00000279928) | FO538757.2 (ENSG00000279457) | AP006222.2 (ENSG00000228463) | ... | AC007325.2 (ENSG00000277196) | BX072566.1 (ENSG00000277630) | AL354822.1 (ENSG00000278384) | AC023491.2 (ENSG00000278633) | AC004556.1 (ENSG00000276345) | AC233755.2 (ENSG00000277856) | AC233755.1 (ENSG00000275063) | AC240274.1 (ENSG00000271254) | AC213203.1 (ENSG00000277475) | FAM231B (ENSG00000268674) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AAACCTGAGAAACCAT-1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| AAACCTGAGAAACCGC-1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| AAACCTGAGAAACCTA-1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| AAACCTGAGAAACGAG-1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| AAACCTGAGAAACGCC-1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 33694 columns
Now that we have loaded the data, we will remove cells with low transcript counts (less than 1000 counts per cell) and unexpressed genes (expressed in less than 5 cells):
scprep.plot.plot_library_size(data, cutoff=1500)
<AxesSubplot:xlabel='Library size', ylabel='Number of cells'>
data = scprep.filter.filter_library_size(data, cutoff=1500, keep_cells='above')
data = scprep.filter.filter_rare_genes(data)
data.shape
(4298, 15401)
Finally, we will library size normalize and square root transform the expression data as is standard in single cell analysis:
data_norm, libsize = scprep.normalize.library_size_normalize(data, return_library_size=True)
data_sqrt = scprep.transform.sqrt(data_norm)
data_sqrt.head()
| RP11-34P13.7 (ENSG00000238009) | FO538757.2 (ENSG00000279457) | AP006222.2 (ENSG00000228463) | RP4-669L17.10 (ENSG00000237094) | RP11-206L10.9 (ENSG00000237491) | LINC00115 (ENSG00000225880) | FAM41C (ENSG00000230368) | NOC2L (ENSG00000188976) | KLHL17 (ENSG00000187961) | PLEKHN1 (ENSG00000187583) | ... | MT-ND6 (ENSG00000198695) | MT-CYB (ENSG00000198727) | BX004987.4 (ENSG00000278704) | AC145212.2 (ENSG00000274847) | AC011043.1 (ENSG00000276256) | AL592183.1 (ENSG00000273748) | AC007325.4 (ENSG00000278817) | AL354822.1 (ENSG00000278384) | AC004556.1 (ENSG00000276345) | AC240274.1 (ENSG00000271254) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AAACCTGAGAAGGCCT-1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | ... | 0.0 | 4.154662 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| AAACCTGAGACAGACC-1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | ... | 0.0 | 6.573422 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| AAACCTGAGATAGTCA-1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | ... | 0.0 | 6.449217 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| AAACCTGAGCGCCTCA-1 | 0.0 | 0.0 | 0.0 | 0.0 | 2.077481 | 0.0 | 0.0 | 2.077481 | 0.0 | 0.0 | ... | 0.0 | 5.876003 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| AAACCTGAGGCATGGT-1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | ... | 0.0 | 3.661874 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 15401 columns
Next, we'll run the PHATE (https://www.nature.com/articles/s41587-019-0336-3), a dimensionality reduction algorithm previously developed by the Krishnaswamy lab to produce powerful visualizations.
phate_op = phate.PHATE()
data_phate = phate_op.fit_transform(data_sqrt)
scprep.plot.scatter2d(
data_phate,
fontsize=16,
s=50,
legend=False,
ticks=False,
label_prefix="PHATE",
figsize=(8, 6),
)
Calculating PHATE...
Running PHATE on 4298 observations and 15401 variables.
Calculating graph and diffusion operator...
Calculating PCA...
Calculated PCA in 10.85 seconds.
Calculating KNN search...
Calculated KNN search in 3.39 seconds.
Calculating affinities...
Calculated affinities in 0.66 seconds.
Calculated graph and diffusion operator in 16.10 seconds.
Calculating landmark operator...
Calculating SVD...
Calculated SVD in 0.59 seconds.
Calculating KMeans...
Calculated KMeans in 11.86 seconds.
Calculated landmark operator in 14.72 seconds.
Calculating optimal t...
Automatically selected t = 16
Calculated optimal t in 5.68 seconds.
Calculating diffusion potential...
Calculated diffusion potential in 1.20 seconds.
Calculating metric MDS...
Calculated metric MDS in 9.21 seconds.
Calculated PHATE in 46.98 seconds.
<AxesSubplot:xlabel='PHATE1', ylabel='PHATE2'>
First, we run the CATCH condensation process on the data as follows:
catch_op = catch.CATCH(knn=30, random_state=18, n_pca=50, n_jobs=1)
catch_op.fit(data_sqrt.sparse.to_dense())
Calculating PCA... Calculated PCA in 7.97 seconds. Calculating Diffusion Condensation... Calculating Condensation Parameters... Calculated Condensation Parameters in 1.21 seconds. Calculated Diffusion Condensation in 122.50 seconds.
Next, we identify granularities for downstream analysis using topological activity analysis:
levels = catch_op.transform()
Calculating Topological Activity... Calculated Topological Activity in 0.01 seconds.
We can now visualize the topological activity and identify granularities which partition single cells into meaningul, stable clusters:
print(levels)
plt.plot(catch_op.gradient)
plt.xlabel("Iterations")
plt.ylabel("Topological Activity")
plt.scatter(len(catch_op.NxTs)+levels, catch_op.gradient[levels+1], c='r')
plt.show()
[-121 -114 -109 -104 -99 -96 -88 -84 -81 -76 -73 -65 -58 -56 -41 -38 -32 -17 -11]
Finally, we can visualize the CATCH computed clusters on our initial PHATE embedding across stable granularities:
fig, axes = plt.subplots(2,2, figsize=(12, 8))
for i, ax in enumerate(axes.flatten()):
scprep.plot.scatter2d(data_phate, c=catch_op.NxTs[levels[-2*i-1]], legend_anchor=(1,1), ax=ax,
title='Granularity '+str(len(catch_op.NxTs)+levels[-2*i-1]),
xticks=False, yticks=False, label_prefix="PHATE", fontsize=10, s=3)
fig.tight_layout()
Now we can compute the condensation homology and visualize some of these stable clusters on the tree to help us identify optimal granularities levels for downstream analysis:
tree = catch_op.build_tree()
for i in range(4):
tree_clusters = catch_op.map_clusters_to_tree(catch_op.NxTs[levels[-2*i-1]])
ax = scprep.plot.scatter3d(tree, c = tree_clusters, title = 'Granularity '+str(len(catch_op.NxTs)+levels[-2*i-1]),
fontsize=16, s = 20, legend = False,
ticks=False, label_prefix="PHATE", figsize=(6,18))
ax.set_axis_off()
We often find that rotating the condensation homology plot allows for the best visualization of cluster seperation:
tree_clusters = catch_op.map_clusters_to_tree(catch_op.NxTs[98])
scprep.plot.rotate_scatter3d(tree, c = tree_clusters, title = 'Granularity 98',
fontsize=16, s = 20, legend = False,
ticks=False, label_prefix=None, figsize=(12,12))